Background, Overview, and Motivation:

There is no community that feels the convergence of government policies and in-home culture more than women and children. We are interested in studying the population of mothers and infants due to their unique vulnerability, and their health outcomes which are disproportionately affected by various exposures and government policies. Our group members had interests in both environmental health and perinatal health, so we decided to assess the effect of environmental exposures on maternal and infant health outcomes. We choose to restrict our analysis of these exposures and outcomes to only California due to both data availability constraints, the state’s population and racial diversity, its comprehensive and transparent environmental regulations, and its high use of pesticides. California is the greatest user of pesticides in the US with over 85 million kg applied annually, an amount equivalent to roughly 30% of the cumulative active ingredients applied to US agriculture.

In 1986, California passed “The Safe Drinking Water and Toxic Enforcement Act of 1986” also known as Proposition 65. Proposition 65 requires businesses to provide warnings regarding significant exposures to chemicals that cause cancer, birth defects or other reproductive harm. By requiring that this information be provided, Proposition 65 enables Californians to make informed decisions about their exposures to these chemicals. California has a list of harmful chemicals as characterized by Proposition 65, which is updated at least once a year and includes over 900 chemicals. Proposition 65 has motivated businesses to eliminate or reduce toxic chemicals in numerous consumer products and has led to the safer reformulation of many products. The law has also been successful in educating the general public about exposures to toxic chemicals in consumer products, buildings, and the environment, which as a result created a demand and market reward for less-toxic products.

California is the most populous state in the United States (with roughly 12% of annual births) and is a very diverse state in regards to both demographics and landscape across its various counties, which lends to increased diversity in the state level data. Due to California’s diverse population, we were able to assess exposure to racism (recorded as race) as a potential confounder to maternal health outcomes. The inclusion of analysis and a discussion surrounding race and racism is critical when doing any research in the field of maternal health. The literature has no shortage of evidence pointing to disproportionately adverse health outcomes for Black mothers and babies in America. Infant mortality rates for America’s Black babies are more than twice the rate of white babies and they are more than three times as likely to die from complications related to low birth weight. U.S. maternal mortality rates for Black women are also three to four times higher than rates for white women. For this reason we decided to classify “the experience of racism” as a confounder in our analysis and to stratify by race to account for this confounding.

Initial Questions & Research Question Evolution:

Our initial question was pretty broad: “What is the effect of pesticide use on Maternal and Child Health?” We then narrowed down our scope to include only data from California (inspired by our background research and related work). Over the course of the project, we defined our exposure as pesticide use (continuous variable measured in pounds). We also narrowed down our outcomes of interest to include: fertility, birth weight, and gestational age. We further narrowed our scope of counties to focus on those with high pesticide use and high agricultural activities due to the focus on these areas in the literature we reviewed (mentioned in the “related work” section). The counties we focused on were ranked in our data as the top 4 counties for highest pesticide use and they included: Fresno, Kern, Tulare, and San Joaquin (in that order). We chose to include Los Angeles as a comparison group for the exploratory analysis of the maternal and child health data because it is one of the most populated and most diverse counties in California, and this was of importance to us due to our interest in examining the maternal and infant health outcomes, stratified by race.

Data Sources:

*See notes regarding scraping, cleaning, and wrangling methods at each respective code chunk

Data Source for Pesticides:

Pesticide use for California counties data was retrieved from the California Department of Pesticide Regulation- Pesticide use reporting program https://www.cdpr.ca.gov/docs/pur/purmain.htm

Maternal and Child Health Data Source(s):

Maternal and Child Health outcome data was mainly obtained from the following three sources:

1, California Open Data Portal https://data.ca.gov/dataset/live-births-with-low-and-very-low-birthweight

  1. CHHS Open Data https://data.chhs.ca.gov/dataset/preterm-and-very-preterm-live-births/resource/cff79e2d-6ecf-4158-9e4f-7078632220ee

  2. Centers for Disease Control and Prevention (CDC) Natality Online Database on the Wide-ranging OnLine Data for Epidemiologic Research (WONDER) system Natality, 2007-2019 Request Centers for Disease Control and Prevention (CDC) Natality Online Database on the Wide-ranging OnLine Data for Epidemiologic Research (WONDER) system https://wonder.cdc.gov/natality-current.html

Exploratory Analysis:

Exploratory Analysis via GGPlots (Lara):

For each of the variables of interest (Fertility Rate, Gestational Age, and Birth Weight) we first visualized the variable across all counties over the years in a tile plot. We then viewed the trend in our counties of interest by filtering by county and plotting the variable on a line plot. These counties include Fresno, Kern, Tulare, San Joaquin (since they are continuously ranked as the top 4 in highest use of pesticides), and Los Angeles (as a control/comparison county) since Los Angeles is one of the most populated and most diverse counties in California. Fresno, Kern, Tulare, and San Joaquin counties are also all a part of San Joaquin Valley which we mentioned in our background and related work section to be of special interest to us because it is California’s most productive agricultural region and has one of the highest amounts of pesticide use. Since confounding by race was of interest to us, we also analyzed the racial demographics across counties by visualizing the total population of each race across all counties in separate tile plots by race. We also visualized the racial demographics data in our counties of interest (Fresno, Kern, Tulare, San Joaquin and Los Angeles). After assessing the racial demographics, we continued to view the trends of our variables of interest (Fertility Rate, Gestational Age, and Birth Weight) across counties, stratified by race. We did this by visualizing each variable of interest for one specific race in a tile plot and repeated this for each race (so that each race had a tile plot of the variable over the years across all counties). We also plotted the variables, stratified by race, in all our counties of interest (Fresno, Kern, Tulare, San Joaquin and Los Angeles) to identify any trends and outcome variables that vary across different races.

Exploratory Analysis via Shiny App (Sonia):

We first created an interactive bar graph, allowing the users to visualize a pattern of low birth weight (<2500 g) and preterm birth (defined as <37 weeks LMP) across a span of years (2007-2016). The database is limited in that it does not show every county-level data for confidentiality reasons, suppressing values for counties that had a population of <100,000. We also created a Shiny map (using a leaflet function) to visualize a general pattern in pre-term births and low birth weight across the state of California. We initially used the WONDER CDC database and filtered it to year 2016. However, since the CDC database system excluded a great # of counties (population <100,000), the map did not come out informative as we had expected. Thus, we consulted additional data sources for preterm birth and low birth weight (California Open Data Portal; CHHS Data Open Portal). Using the leaflet map function, we were able to recognize a general pattern of prevalences in pre-term births and low birth weight across the state of California. However, we also saw that a relatively high rates in adverse perinatal outcomes occurred in the central region of the state.

Regression Analysis (Zainab):

We were only able to retrieve pesticide data from the California Department of Pesticide Regulation up to 2016, so the regression analysis was restricted to that year. Counties were categorized based on rank of the top ten agricultural counties, according to the California Department of Food and Agriculture Production Statistics. These counties, in no specific order, were Kern, Tulare, Fresno, Monterey, Merced, Stanislaus, San Joaquin, Imperial, Ventura, Kings county. Most of these counties are located in the San Joaquin Valley and all were highly ranked in terms of pesticide usage, according to the California Department of Pesticide Regulation.

Initially, we considered logistic regression. We were interested in seeing if the different perinatal outcomes (birthweight and gestational age at birth) could predict the odds of being in a top ten agricultural county. Unfortunately, the logistic models did not appear to model the data well, as there were counties that had very small populations. Because they had small populations, they also had few births, and it was possible that no babies were born with low-birth weight or preterm. These appeared as outliers in all the scatterplots, and we felt that excluding those counties was not appropriate. So, instead of looking at the top ten ranking as an outcome, we decided to look at it as a predictor of the different birth outcomes. Linear regression was the next option for regression analysis, since the count data could easily be transformed into rates. Poisson regression was not used because there was difficulty in determining an appropriate offset.

Final Analysis:

Exploratory Analysis via GGPlots (Lara):

From the exploratory data analysis using ggplots, we learned that Tulare, Fresno, Kings, Kern, and Imperial Counties (all counties with high agricultural and pesticide use rankings) had the highest fertility rates, and had higher fertility rates than Los Angeles (our comparison county). One reason for this could be the difference in access to and attitudes regarding family planning and contraception in these rural, agricultural counties. However we observed that all counties had a decreasing fertility rate over the years which matches the global and national average trends. Before stratifying by race we were curious to see if there was a disproportionate amount of any one race in any of the counties, and we found that no specific county stood out for having a disproportionate amount of any one race, except Los Angeles county which appeared to be the country where each race had the highest total population (as expected since we know that Los Angeles is one of the most populated and most diverse counties in california. To assess exposure to racism (measured as race) as a potential confounder went on to assess all our outcomes of interest stratified by race.

We first assessed fertility rates stratified by race and found that White and Asian or Pacific Islander Populations had the highest fertility rates across all counties, compared to Black or African American and American Indian or Alaska Native with American Indian or Alaska Native populations having the lowest fertility rates across all counties.In all our counties of interest, White populations had the highest fertility rates followed by Asian or Pacific Islander Populations in Fresno County and Tulare (Ranked #1 and #3 for Pesticide Use respectively) and Black or African American Populations in Kern County and San Joaquin County (Ranked #2 and #4 for Pesticide Use respectively).

We also analyzed gestational age at birth (measured in weeks via last menstrual period method) and stratified by race. We found that in the aggregate data, most counties seemed to have relatively high average gestational age at birth, but Fresno seemed to have a slightly lower average gestational age at birth compared to other counties, and since Fresno is ranked #1 in pesticide use, this may be cause to further explore a potential relationship between pesticide exposure and low gestational age although since the data is so similar across counties, there is a high probability that this is not a statistically significant difference. When stratifying by race we found, as expected, that Black or African American populations experienced more variability and lower gestational ages across counties. In all our counties of interest, we saw that Black or African American populations had the lowest gestational age at birth compared to other races, a national trend also found in the literature. However we also observed that none of the counties of interest had average gestational ages which equaled or were more extreme than the preterm birth cutoff of 37 weeks.

We also analyzed birth weight (measured in grams) and stratified our analysis by race. We found that in the aggregate data, most counties seemed to have relatively high average birth weights and were very similar across the years, but Fresno seemed to have a slightly lower average birth weight at birth compared to other counties, and since Fresno is ranked #1 in pesticide use, this may be cause to further explore a potential relationship between pesticide exposure and low birthweight although since the data is so similar across counties, there is a high probability that this is not a statistically significant difference. When stratifying by race we found, as expected, that Black or African American populations experienced more variability and lower birth weight across counties. In all our counties of interest, we saw that Black or African American populations had the lowest birth weights compared to other races, a national trend also found in the literature. However we also observed that none of the counties of interest had average birth weights which equaled or were more extreme than the low birthweight cutoff of 2500 grams.

In conclusion, we found that our counties of interest (i.e. counties with high pesticide use) had higher fertility rates on average than counties without high pesticide use. The relationship between exposure to racism (measured as race) and adverse perinatal health outcomes was apparent in this data as we observed lower birth weights and gestational ages for Black and African American Populations compared to other races. However there was no evidence in the initial data exploration to support a statistically significant conclusion that our counties of interest (i.e. counties with high pesticide use) had adverse perinatal health outcomes compared to counties without high pesticide use.

Exploratory Analysis via Shiny App (Sonia):

Using the same Shiny leaflet function, we tried to see if this region was associated with any use of pesticides. Note that California has a broad legal definition of pesticide: it includes pesticides applied in agriculture, parks, golf courses, cemeteries, etc. As can seen from the shiny app, we noticed that counties that sit around the central region (also known as San Joaquin Valley) of California have consistently been the highest pesticide users from 2007-2016. In fact, San Joaquin Valley is known to be the largest agricultural producer in the world! Naturally, our next question was whether high pesticide use in this particular region was associated with adverse perinatal outcomes. We speculated a possible correlation, and thought it would be interesting to investigate the relationship at a closer-look, using regression analysis.

Regression Analysis (Zainab):

We were able to create a linear regression model to determine the association between top ten ranking, average gestational age, and average birth weight by including an interaction term. The model is described in the following equation:

BW = -6880.061 + 262.434 [AGE] + 5256.662 [I(TOP)] - 136.464 [AGE]x[I(TOP)],

Where BW is the average birth weight, AGE is the average LMP gestational age, and I(TOP) is an indicator variable for being a top ten agricultural county. According to the adjusted R-squared value, the model explains 46% of the variability in the data. Overall there is a positive relationship between gestational age and birth weight, as expected, but the trend is less pronounced amongst mothers living in top agricultural counties. For two babies that are the same gestational age, the baby born in a top ranked agricultural county will have a smaller birth weight, on average. This suggests that living in a highly agricultural environment, an environment with a lot of pesticide usage, has an adverse effect on birth weight.

We were also interested in how this trend varies by race. According to the data, Asian, Pacific Islander, and Black mothers tend to have babies that weigh less at birth than American Indian, Alaska Native, or White mothers. We were able to successfully develop an appropriate linear model for all categories except for Asian and Pacific Islander mothers. The models for American Indian and Alaska Native, Black, and White mothers are: BW = -3829.713 + 184.977 [AGE], BW = -5672.432 + 230.120 [AGE] + 5451.896 [I(TOP)] - 142.404 [AGE]x[I(TOP)], and BW = -5250.834 + 221.394 [AGE] + 6590.076 [I(TOP)] - 170.141 [AGE]x[I(TOP)],

respectively.

The population of American Indian and Alaska Native people was small in every county, so they will be excluded from the following comparison.

The difference between babies born in high ranked agricultural counties and those not born in high ranked agricultural counties is smaller among those born to Black mothers as compared to those born to White mothers. However, babies born to White mothers tend to be older (in terms of gestational age), and have higher birth weights.

Imperial County appeared as an influential point in most of the linear models. Imperial did not appear to have extreme values in any variable category, and the outliers involved low birth weights for high gestational ages in Asian and Pacific Islander and Black mothers. We already determined that these populations have babies with low birth weights on average, so it was not appropriate to exclude the points from our analysis.

Load Packages

library(dplyr)
library(tidyverse)
library(pdftools)
library(readr)
library(stringr)
library(ggthemes)
library(shiny)
library(shinyBS)
library(RColorBrewer)
library(shinydashboard)
library(sp)
library(rgeos)
library(rgdal)
library(maptools)
library(leaflet)
library(scales)
library(maps)
library(gridExtra)
library(grid)

MCH Data Wrangling *Low BW Only (for map)

We wanted to create a map to visualize the pattern of low birth weight in California. I initially used CDC WONDER database but it had very limited information, where it grouped rural and small county-level data to “unidentified counties.” Thus, I consulted another data source from California Open Data Portal. Following is the data cleaning process:

#Data Wrangling for Year 2014-2018 Data for Map.  
lbwdata<-read.csv("./low-and-very-low-birthweight-by-county-2014-2018 (1).csv", header = TRUE, stringsAsFactors = FALSE)
lbwdata <- lbwdata %>% mutate(County = str_to_title(County))
lbwdata$Events[is.na(lbwdata$Events)] <- 0
lbwdata <- lbwdata %>% group_by(Year, County, Total.Births) %>% summarize(Events = sum(Events)) 
## `summarise()` regrouping output by 'Year', 'County' (override with `.groups` argument)
lbwdata <- lbwdata %>% filter(!County == "california") 
lbwdata <- lbwdata %>% mutate(Rate = Events/Total.Births)

MCH Data Wrangling *Preterm Birth Only (for map)

We also wanted to create another map to visualize the pattern of preterm birth in California. Again, I ran into a similar problem using CDC WONDER database. Thus, I consulted the CHHS database Following is the data cleaning process:

ptbirthdata<- read.csv("preterm-and-very-preterm-births-by-county-2010-2018-3.csv", header = TRUE, stringsAsFactors = FALSE)
ptbirthdata$Events[is.na(ptbirthdata$Events)] <- 0
ptbirthdata <- ptbirthdata[,-c(7,8)]
ptbirthdata <- ptbirthdata %>% group_by(Year, County, Total.Births) %>% summarize(Events = sum(Events)) 
## `summarise()` regrouping output by 'Year', 'County' (override with `.groups` argument)
ptbirthdata <- ptbirthdata %>% filter(!County == "california")#removing the total count
ptbirthdata <- ptbirthdata %>% mutate(rate_pt = Events/Total.Births * 100)

Code for Creating the Map - Leaflet Map Using LBW Data (See Shiny App for the Final Result)

After cleaning the data set, I then looked at creating a “spatial” map. Following is the wrangling process of generating a map that shapes the map of California, and merging that spatial object with low birth weight data that I wrangled earlier to generate a leaflet map. My main motivation of using a leaflet map was because I wanted to create a map where the user can see which county is which and is able to zoom in and out. Note that there are counties that had NA cases (perhaps for counties that had a very small population).

map <- readOGR(path.expand("cb_2018_us_county_20m.shp"),
               layer = "cb_2018_us_county_20m", stringsAsFactors = FALSE)
## OGR data source with driver: ESRI Shapefile 
## Source: "/Users/lararostomian/Desktop/Harvard/Classes/BST 260/datascience-project/Data Prep (& Final RMD)/cb_2018_us_county_20m.shp", layer: "cb_2018_us_county_20m"
## with 3220 features
## It has 9 fields
## Integer64 fields read as strings:  ALAND AWATER
Statekey<-read.csv('./STATEFPtoSTATENAME_Key.csv', colClasses=c('character'))
map<-merge(x=map, y=Statekey, by="STATEFP", all=TRUE)
SingleState <- subset(map, map$STATENAME %in% c(
    "California"
))

lbwdata_2016 <- lbwdata %>% filter(Year == "2016") %>% mutate(Rate = Events/Total.Births*100)
spatial_lbw <-sp::merge(x=SingleState, y=lbwdata_2016, by.x="NAME", by.y="County", by=x)

bins <- c(4.0,6.3,7.6,8.1, Inf)
pal <- colorBin(
    palette = "viridis",
    domain = spatial_lbw$Rate, n=7, bins=bins)

leaflet(spatial_lbw, options = leafletOptions(zoomControl = TRUE, zoomLevelFixed = FALSE, dragging=TRUE, minZoom = 5.3, maxZoom = 9)) %>% 
            setView(lat = 36.778259, lng = -119.417931, zoom = 6) %>%
            addPolygons(color = "Black", weight = 1, smoothFactor = 0.5, 
                        opacity = 1.0, fillOpacity = 0.5, layerId = ~NAME,
                        fillColor = ~pal(Rate), 
                        popup = ~as.factor(paste0("<b><font size=\"4\"><center>County: </b>",spatial_lbw$NAME,"</font></center>","<b>% of Low Birth Weight Births: </b>", sprintf("%1.2f%%", spatial_lbw$Rate),"<br/>"))) %>%
            addLegend(pal = pal, values = spatial_lbw$Rate, opacity = 1, title="% Low Birth Weight (Quartiles)")
## Warning in pal(Rate): Some values were outside the color scale and will be
## treated as NA

Code for Creating the Map - Leaflet Map Using Preterm Birth Data Wrangled Earlier (See Shiny App for the Final Result)

This is a similar spatial map but for preterm birth. Following is the wrangling process of generating a map that shapes the map of California, and merging that spatial object with pre-term birth data that I wrangled earlier to generate a leaflet map. Note that there are counties that had NA cases (perhaps for counties that had a very small population).

map <- readOGR(path.expand("cb_2018_us_county_20m.shp"),
               layer = "cb_2018_us_county_20m", stringsAsFactors = FALSE)
## OGR data source with driver: ESRI Shapefile 
## Source: "/Users/lararostomian/Desktop/Harvard/Classes/BST 260/datascience-project/Data Prep (& Final RMD)/cb_2018_us_county_20m.shp", layer: "cb_2018_us_county_20m"
## with 3220 features
## It has 9 fields
## Integer64 fields read as strings:  ALAND AWATER
Statekey<-read.csv('./STATEFPtoSTATENAME_Key.csv', colClasses=c('character'))
map<-merge(x=map, y=Statekey, by="STATEFP", all=TRUE)
SingleState <- subset(map, map$STATENAME %in% c(
    "California"
))

ptbirthdata_2016 <- ptbirthdata %>% filter(Year == "2016") 
spatial_pt <-sp::merge(x=SingleState, y=ptbirthdata_2016, by.x="NAME", by.y="County", by=x)

bin <- c(5.5, 8.2, 9.1, 9.9, Inf)
pal2 <- colorBin(
    palette = "plasma",
    domain = spatial_pt$rate_pt, n=7, bins=bin)

leaflet(spatial_pt, options = leafletOptions(zoomControl = TRUE, zoomLevelFixed = FALSE, dragging=TRUE, minZoom = 5.3, maxZoom = 9)) %>% 
                    setView(lat = 36.778259, lng = -119.417931, zoom = 6) %>%
                    addPolygons(color = "Black", weight = 1, smoothFactor = 0.5, 
                                opacity = 1.0, fillOpacity = 0.5, layerId = ~NAME,
                                fillColor = ~pal2(rate_pt), 
                                popup = ~as.factor(paste0("<b><font size=\"4\"><center>County: </b>",spatial_pt$NAME,"</font></center>","<b>% of Preterm Birth: </b>", sprintf("%1.2f%%", spatial_pt$rate_pt),"<br/>"))) %>% addLegend(pal = pal2, values = spatial_pt$rate_pt, opacity = 1, title="% Preterm Birth (Quartiles)")
## Warning in pal2(rate_pt): Some values were outside the color scale and will be
## treated as NA

Maternal and Child Health Indicators Data Wrangling of CDC Data

The data sets used for the exploratory analysis of MCH indicators by county in California were downloaded from the CDC Wonder Database in “.txt” format. I read the .txt files into the rmd file and turned them into data frames. The first data frame, MCH.CDC.Data had Maternal and Infant Health Outcomes by county over the years, the MCH.CDC.Data_Race had the same variables as the MCH.CDC.Data frame but was stratified by Mother’s Race. I also renamed all the counties in these two data frames to match the same names (re: case and format) as the counties in the pesticide data frames for easier comparison of the variables in these two data frames when comparing by county. Below is the data wrangling and cleaning code for the Maternal and Child Health Data from the CDC Wonder Source.

#Data Wrangling for CDC Data (COMPLETE)
MCH.CDC.Data <- read.delim("NatalityTOTAL.txt",  sep ="\t", dec=".", header = TRUE, stringsAsFactors = FALSE)
MCH.CDC.Data <- MCH.CDC.Data[-c(491:585), ]
MCH.CDC.Data <- MCH.CDC.Data %>% filter(Notes != "Total")
MCH.CDC.Data <- MCH.CDC.Data[ ,-c(1,3,5,7,9)]

MCH.CDC.Data_Race <- read.delim("NatalityRACE.txt",  sep ="\t", dec=".", header = TRUE, stringsAsFactors = FALSE)
MCH.CDC.Data_Race <- MCH.CDC.Data_Race[-c(1794:1931), ]
MCH.CDC.Data_Race <- MCH.CDC.Data_Race[ ,-c(1,3,5,7,9,11)]

#Rename Counties to Match Pesticide Data
MCH.CDC.Data[MCH.CDC.Data$County == "Alameda County, CA", "County"] <-"Alameda"
MCH.CDC.Data[MCH.CDC.Data$County == "Butte County, CA", "County"] <-"Butte"
MCH.CDC.Data[MCH.CDC.Data$County == "Contra Costa County, CA", "County"] <-"Contra Costa"
MCH.CDC.Data[MCH.CDC.Data$County == "El Dorado County, CA", "County"] <-"El Dorado"
MCH.CDC.Data[MCH.CDC.Data$County == "Fresno County, CA", "County"] <-"Fresno"
MCH.CDC.Data[MCH.CDC.Data$County == "Humboldt County, CA", "County"] <-"Humboldt"
MCH.CDC.Data[MCH.CDC.Data$County == "Imperial County, CA", "County"] <-"Imperial"
MCH.CDC.Data[MCH.CDC.Data$County == "Kern County, CA", "County"] <-"Kern"
MCH.CDC.Data[MCH.CDC.Data$County == "Kings County, CA", "County"] <-"Kings"
MCH.CDC.Data[MCH.CDC.Data$County == "Los Angeles County, CA", "County"] <-"Los Angeles"
MCH.CDC.Data[MCH.CDC.Data$County == "Madera County, CA", "County"] <-"Madera"
MCH.CDC.Data[MCH.CDC.Data$County == "Marin County, CA", "County"] <-"Marin"
MCH.CDC.Data[MCH.CDC.Data$County == "Contra Costa County, CA", "County"] <-"Mariposa"
MCH.CDC.Data[MCH.CDC.Data$County == "Merced County, CA", "County"] <-"Merced"
MCH.CDC.Data[MCH.CDC.Data$County == "Monterey County, CA", "County"] <-"Monterey"
MCH.CDC.Data[MCH.CDC.Data$County == "Napa County, CA", "County"] <-"Napa"
MCH.CDC.Data[MCH.CDC.Data$County == "Orange County, CA", "County"] <-"Orange"
MCH.CDC.Data[MCH.CDC.Data$County == "Placer County, CA", "County"] <-"Placer"
MCH.CDC.Data[MCH.CDC.Data$County == "Riverside County, CA", "County"] <-"Riverside"
MCH.CDC.Data[MCH.CDC.Data$County == "Sacramento County, CA", "County"] <-"Sacramento"
MCH.CDC.Data[MCH.CDC.Data$County == "San Bernardino County, CA", "County"] <-"San Bernardino"
MCH.CDC.Data[MCH.CDC.Data$County == "San Diego County, CA", "County"] <-"San Diego"
MCH.CDC.Data[MCH.CDC.Data$County == "San Francisco County, CA", "County"] <-"San Francisco"
MCH.CDC.Data[MCH.CDC.Data$County == "San Joaquin County, CA", "County"] <-"San Joaquin"
MCH.CDC.Data[MCH.CDC.Data$County == "San Luis Obispo County, CA", "County"] <-"San Luis Obispo"
MCH.CDC.Data[MCH.CDC.Data$County == "San Mateo County, CA", "County"] <-"San Mateo"
MCH.CDC.Data[MCH.CDC.Data$County == "Santa Barbara County, CA", "County"] <-"Santa Barbara"
MCH.CDC.Data[MCH.CDC.Data$County == "Santa Clara County, CA", "County"] <-"Canta Clara"
MCH.CDC.Data[MCH.CDC.Data$County == "Santa Cruz County, CA", "County"] <-"Santa Cruz"
MCH.CDC.Data[MCH.CDC.Data$County == "Shasta County, CA", "County"] <-"Shasta"
MCH.CDC.Data[MCH.CDC.Data$County == "Solano County, CA", "County"] <-"Solano"
MCH.CDC.Data[MCH.CDC.Data$County == "Sonoma County, CA", "County"] <-"Sonoma"
MCH.CDC.Data[MCH.CDC.Data$County == "Stanislaus County, CA", "County"] <-"Stanislaus"
MCH.CDC.Data[MCH.CDC.Data$County == "Tulare County, CA", "County"] <-"Tulare"
MCH.CDC.Data[MCH.CDC.Data$County == "Ventura County, CA", "County"] <-"Ventura"
MCH.CDC.Data[MCH.CDC.Data$County == "Yolo County, CA", "County"] <-"Yolo"

MCH.CDC.Data_Race[MCH.CDC.Data_Race$County == "Alameda County, CA", "County"] <-"Alameda"
MCH.CDC.Data_Race[MCH.CDC.Data_Race$County == "Butte County, CA", "County"] <-"Butte"
MCH.CDC.Data_Race[MCH.CDC.Data_Race$County == "Contra Costa County, CA", "County"] <-"Contra Costa"
MCH.CDC.Data_Race[MCH.CDC.Data_Race$County == "El Dorado County, CA", "County"] <-"El Dorado"
MCH.CDC.Data_Race[MCH.CDC.Data_Race$County == "Fresno County, CA", "County"] <-"Fresno"
MCH.CDC.Data_Race[MCH.CDC.Data_Race$County == "Humboldt County, CA", "County"] <-"Humboldt"
MCH.CDC.Data_Race[MCH.CDC.Data_Race$County == "Imperial County, CA", "County"] <-"Imperial"
MCH.CDC.Data_Race[MCH.CDC.Data_Race$County == "Kern County, CA", "County"] <-"Kern"
MCH.CDC.Data_Race[MCH.CDC.Data_Race$County == "Kings County, CA", "County"] <-"Kings"
MCH.CDC.Data_Race[MCH.CDC.Data_Race$County == "Los Angeles County, CA", "County"] <-"Los Angeles"
MCH.CDC.Data_Race[MCH.CDC.Data_Race$County == "Madera County, CA", "County"] <-"Madera"
MCH.CDC.Data_Race[MCH.CDC.Data_Race$County == "Marin County, CA", "County"] <-"Marin"
MCH.CDC.Data_Race[MCH.CDC.Data_Race$County == "Contra Costa County, CA", "County"] <-"Mariposa"
MCH.CDC.Data_Race[MCH.CDC.Data_Race$County == "Merced County, CA", "County"] <-"Merced"
MCH.CDC.Data_Race[MCH.CDC.Data_Race$County == "Monterey County, CA", "County"] <-"Monterey"
MCH.CDC.Data_Race[MCH.CDC.Data_Race$County == "Napa County, CA", "County"] <-"Napa"
MCH.CDC.Data_Race[MCH.CDC.Data_Race$County == "Orange County, CA", "County"] <-"Orange"
MCH.CDC.Data_Race[MCH.CDC.Data_Race$County == "Placer County, CA", "County"] <-"Placer"
MCH.CDC.Data_Race[MCH.CDC.Data_Race$County == "Riverside County, CA", "County"] <-"Riverside"
MCH.CDC.Data_Race[MCH.CDC.Data_Race$County == "Sacramento County, CA", "County"] <-"Sacramento"
MCH.CDC.Data_Race[MCH.CDC.Data_Race$County == "San Bernardino County, CA", "County"] <-"San Bernardino"
MCH.CDC.Data_Race[MCH.CDC.Data_Race$County == "San Diego County, CA", "County"] <-"San Diego"
MCH.CDC.Data_Race[MCH.CDC.Data_Race$County == "San Francisco County, CA", "County"] <-"San Francisco"
MCH.CDC.Data_Race[MCH.CDC.Data_Race$County == "San Joaquin County, CA", "County"] <-"San Joaquin"
MCH.CDC.Data_Race[MCH.CDC.Data_Race$County == "San Luis Obispo County, CA", "County"] <-"San Luis Obispo"
MCH.CDC.Data_Race[MCH.CDC.Data_Race$County == "San Mateo County, CA", "County"] <-"San Mateo"
MCH.CDC.Data_Race[MCH.CDC.Data_Race$County == "Santa Barbara County, CA", "County"] <-"Santa Barbara"
MCH.CDC.Data_Race[MCH.CDC.Data_Race$County == "Santa Clara County, CA", "County"] <-"Santa Clara"
MCH.CDC.Data_Race[MCH.CDC.Data_Race$County == "Santa Cruz County, CA", "County"] <-"Santa Cruz"
MCH.CDC.Data_Race[MCH.CDC.Data_Race$County == "Shasta County, CA", "County"] <-"Shasta"
MCH.CDC.Data_Race[MCH.CDC.Data_Race$County == "Solano County, CA", "County"] <-"Solano"
MCH.CDC.Data_Race[MCH.CDC.Data_Race$County == "Sonoma County, CA", "County"] <-"Sonoma"
MCH.CDC.Data_Race[MCH.CDC.Data_Race$County == "Stanislaus County, CA", "County"] <-"Stanislaus"
MCH.CDC.Data_Race[MCH.CDC.Data_Race$County == "Tulare County, CA", "County"] <-"Tulare"
MCH.CDC.Data_Race[MCH.CDC.Data_Race$County == "Ventura County, CA", "County"] <-"Ventura"
MCH.CDC.Data_Race[MCH.CDC.Data_Race$County == "Yolo County, CA", "County"] <-"Yolo"

MCH.CDC.Data_Race<- MCH.CDC.Data_Race %>% rename("Mothers.Race" = "Mother.s.Bridged.Race")

Data Key:

Exploratory Data Analysis of Maternal and Infant Health Indicators from the CDC Wonder Database:

Fertility Rate

The first variable I examined was fertility rate and I first visualized the fertility rates across all counties over the years in a tile plot. I then viewed the trend in our counties of interest which include Fresno, Kern, Tulare, San Joaquin (since they are continuously ranked as the top 4 in highest use of pesticides), and Los Angeles (as a control/comparison county) since LA is one of the most populated and most diverse counties in California. Fresno, Kern, Tulare, and San Joaquin counties are also all a part of San Joaquin Valley which we mentioned in our background and related work section to be of special interest to us because it is California’s most productive agricultural region and has one of the highest amounts of pesticide use.

#Fertility Rate GGPLOTS
MCH.CDC.Data %>% 
    ggplot(aes(x = Year, y = County,  fill = Fertility.Rate)) +
    geom_tile(color = "grey50") +
    scale_x_continuous(expand = c(0,0)) +
    scale_fill_gradientn("Fertility Rate", limits = c(30,100),
                         colors = brewer.pal(9, "Reds")) +
    theme_minimal() +  
    theme(panel.grid = element_blank()) +
    ggtitle("Fertility Rate by County") + 
    ylab("") + xlab("")

#Fresno, Ranked #1 in Pesticide Use
MCH.CDC.Data %>% group_by(County) %>% filter(County == "Fresno") %>% ggplot(aes(Year, Fertility.Rate, color = County)) + geom_line() + labs( y="Fertility Rate", title = "Fertility Rate in Fresno County", subtitle = "2007-2019", color = "County", caption = "Data Source: CDC WONDER Online Database") + ylim(30,100)

#Kern, Ranked #2 in Pesticide Use
MCH.CDC.Data %>% group_by(County) %>% filter(County == "Kern") %>% ggplot(aes(Year, Fertility.Rate, color = County)) + geom_line() + labs( y="Fertility Rate", title = "Fertility Rate in Kern County", subtitle = "2007-2019", color = "County", caption = "Data Source: CDC WONDER Online Database") + ylim(30,100)

#Tulare, Ranked #3 in Pesticide Use
MCH.CDC.Data %>% group_by(County) %>% filter(County == "Tulare") %>% ggplot(aes(Year, Fertility.Rate, color = County)) + geom_line() + labs( y="Fertility Rate", title = "Fertility Rate in Tulare County", subtitle = "2007-2019", color = "County", caption = "Data Source: CDC WONDER Online Database") + ylim(30,100)

#San Joaquin, Ranked #4 in Pesticide Use
MCH.CDC.Data %>% group_by(County) %>% filter(County == "San Joaquin") %>% ggplot(aes(Year, Fertility.Rate, color = County)) + geom_line() + labs( y="Fertility Rate", title = "Fertility Rate in San Joaquin County", subtitle = "2007-2019", color = "County", caption = "Data Source: CDC WONDER Online Database") + ylim(30,100)

#Los Angeles, Comparison Group 
MCH.CDC.Data %>% group_by(County) %>% filter(County == "Los Angeles") %>% ggplot(aes(Year, Fertility.Rate, color = County)) + geom_line() + labs( y="Fertility Rate", title = "Fertility Rate in Los Angeles County", subtitle = "2007-2019", color = "County", caption = "Data Source: CDC WONDER Online Database") + ylim(30,100)

#Grid Plots
p1 <- MCH.CDC.Data %>% group_by(County) %>% filter(County == "Fresno") %>% ggplot(aes(Year, Fertility.Rate, color = County)) + geom_line() + labs( y="Fertility Rate", title = "Fertility Rate in Fresno County", subtitle = "2007-2019", color = "County") + theme(legend.position = "none") + ylim(30,100)

p2 <- MCH.CDC.Data %>% group_by(County) %>% filter(County == "Kern") %>% ggplot(aes(Year, Fertility.Rate, color = County)) + geom_line() + labs( y="Fertility Rate", title = "Fertility Rate in Kern County", subtitle = "2007-2019", color = "County") + theme(legend.position = "none") + ylim(30,100)

p3 <-MCH.CDC.Data %>% group_by(County) %>% filter(County == "Tulare") %>% ggplot(aes(Year, Fertility.Rate, color = County)) + geom_line() + labs( y="Fertility Rate", title = "Fertility Rate in Tulare County", subtitle = "2007-2019", color = "County") + theme(legend.position = "none") + ylim(30,100)

p4 <-MCH.CDC.Data %>% group_by(County) %>% filter(County == "Los Angeles") %>% ggplot(aes(Year, Fertility.Rate, color = County)) + geom_line() + labs( y="Fertility Rate", title = "Fertility Rate in Los Angeles County", subtitle = "2007-2019", color = "County") + theme(legend.position = "none") + ylim(30,100)

grid.arrange(p1, p2, p3, p4, bottom = "Data Source: CDC WONDER Online Database")

Racial Demographics

The next variable I examined was the racial demographics and I did so by stratifying the MCH.CDC.Data by race (via the MCH.CDC.Data_Race data frame) and viewing the total population of each race across all counties in the tile plots to assess if there was any one county with a more dense population of a certain race. I then viewed the racial demographic trends in our counties of interest which include Fresno, Kern, Tulare, San Joaquin (since they are continuously ranked as the top 4 in highest use of pesticides), and Los Angeles (as a control/comparison county) since LA is one of the most populated and most diverse counties in California. Fresno, Kern, Tulare, and San Joaquin counties are also all a part of San Joaquin Valley which we mentioned in our background and related work section to be of special interest to us because it is California’s most productive agricultural region and has one of the highest amounts of pesticide use.

#Racial Demographic GGPLOTS
MCH.CDC.Data_Race %>% filter(Mothers.Race == "American Indian or Alaska Native") %>%
    ggplot(aes(x = Year, y = County,  fill = Total.Population)) +
    geom_tile(color = "grey50") +
    scale_x_continuous(expand = c(0,0)) +
    scale_fill_gradientn("American Indian or Alaska Native Population",
                         colors = brewer.pal(9, "Reds")) +
    theme_minimal() +  
    theme(panel.grid = element_blank()) +
    ggtitle("Racial Demographics by County") + 
    ylab("") + xlab("") 

MCH.CDC.Data_Race %>% filter(Mothers.Race == "Asian or Pacific Islander") %>%
    ggplot(aes(x = Year, y = County,  fill = Total.Population)) +
    geom_tile(color = "grey50") +
    scale_x_continuous(expand = c(0,0)) +
    scale_fill_gradientn("Asian or Pacific Islander", 
                         colors = brewer.pal(9, "Reds")) +
    theme_minimal() +  
    theme(panel.grid = element_blank()) +
    ggtitle("Racial Demographics by County") + 
    ylab("") + xlab("")

MCH.CDC.Data_Race %>% filter(Mothers.Race == "Black or African American") %>%
    ggplot(aes(x = Year, y = County,  fill = Total.Population)) +
    geom_tile(color = "grey50") +
    scale_x_continuous(expand = c(0,0)) +
    scale_fill_gradientn("Black or African American Population",
                         colors = brewer.pal(9, "Reds")) +
    theme_minimal() +  
    theme(panel.grid = element_blank()) +
    ggtitle("Racial Demographics by County") + 
    ylab("") + xlab("")

MCH.CDC.Data_Race %>% filter(Mothers.Race == "White") %>%
    ggplot(aes(x = Year, y = County,  fill = Total.Population)) +
    geom_tile(color = "grey50") +
    scale_x_continuous(expand = c(0,0)) +
    scale_fill_gradientn("White Population",
                         colors = brewer.pal(9, "Reds")) +
    theme_minimal() +  
    theme(panel.grid = element_blank()) +
    ggtitle("Racial Demographics by County") + 
    ylab("") + xlab("")

#Fresno, Ranked #1 in Pesticide Use
MCH.CDC.Data_Race %>% group_by(County) %>% filter(County == "Fresno") %>% ggplot(aes(Year, Total.Population, color = Mothers.Race)) + geom_line() + labs( y="Total Population (Log 10 Transformation)", title = "Racial Demographics in Fresno County (Transformed Y Axis)", subtitle = "2007-2019", color = "Mother's Race", caption = "Data Source: CDC WONDER Online Database") + scale_y_log10()

#Kern, Ranked #2 in Pesticide Use
MCH.CDC.Data_Race %>% group_by(County) %>% filter(County == "Kern") %>%  ggplot(aes(Year, Total.Population, color = Mothers.Race)) + geom_line() + labs( y="Total Population (Log 10 Transformation)", title = "Racial Demographics in Kern County (Transformed Y Axis)", subtitle = "2007-2019", color = "Mother's Race", caption = "Data Source: CDC WONDER Online Database") + scale_y_log10()

#Tulare, Ranked #3 in Pesticide Use
MCH.CDC.Data_Race %>% group_by(County) %>% filter(County == "Tulare") %>%  ggplot(aes(Year, Total.Population, color = Mothers.Race)) + geom_line() +  labs( y="Total Population (Log 10 Transformation)", title = "Racial Demographics in Tulare County (Transformed Y Axis)", subtitle = "2007-2019", color = "Mother's Race", caption = "Data Source: CDC WONDER Online Database") + scale_y_log10()

#San Joaquin, Ranked #4 in Pesticide Use
MCH.CDC.Data_Race %>% group_by(County) %>% filter(County == "San Joaquin") %>%  ggplot(aes(Year, Total.Population, color = Mothers.Race)) + geom_line()  + labs( y="Total Population (Log 10 Transformation)", title = "Racial Demographics in San Joaquin County (Transformed Y Axis)", subtitle = "2007-2019", color = "Mother's Race", caption = "Data Source: CDC WONDER Online Database") + scale_y_log10()

#Los Angeles, Comparison Group
MCH.CDC.Data_Race %>% group_by(County) %>% filter(County == "Los Angeles") %>%  ggplot(aes(Year, Total.Population, color = Mothers.Race)) + geom_line() + labs( y="Total Population (Log 10 Transformation)", title = "Racial Demographics in Los Angeles County (Transformed Y Axis)", subtitle = "2007-2019", color = "Mother's Race", caption = "Data Source: CDC WONDER Online Database") + scale_y_log10()

##### Fertility Rate (Stratified By Race) The next variable I examined was once again fertility rate but this time stratified by race. I first visualized the fertility rates across all counties over the years by each race in a tile plot. I then viewed the trends of fertility rates by race in our counties of interest which include Fresno, Kern, Tulare, San Joaquin (since they are continuously ranked as the top 4 in highest use of pesticides), and Los Angeles (as a control/comparison county) since LA is one of the most populated and most diverse counties in California. Fresno, Kern, Tulare, and San Joaquin counties are also all a part of San Joaquin Valley which we mentioned in our background and related work section to be of special interest to us because it is California’s most productive agricultural region and has one of the highest amounts of pesticide use.

#Race and Fertility GGPLOTS
MCH.CDC.Data_Race %>% filter(Mothers.Race == "American Indian or Alaska Native") %>%
ggplot(aes(x = Year, y = County,  fill = Fertility.Rate)) +
    geom_tile(color = "grey50") +
    scale_x_continuous(expand = c(0,0)) +
    scale_fill_gradientn("Fertility Rate", limits = c(0,115),
                         colors = brewer.pal(9, "Reds")) +
    theme_minimal() +  
    theme(panel.grid = element_blank()) +
    ggtitle("Fertility Rate by County for American Indian or Alaska Native Pop.") + 
    ylab("") + xlab("") 

MCH.CDC.Data_Race %>% filter(Mothers.Race == "Asian or Pacific Islander") %>%
ggplot(aes(x = Year, y = County,  fill = Fertility.Rate)) +
    geom_tile(color = "grey50") +
    scale_x_continuous(expand = c(0,0)) +
    scale_fill_gradientn("Fertility Rate",limits = c(0,115),
                         colors = brewer.pal(9, "Reds")) +
    theme_minimal() +  
    theme(panel.grid = element_blank()) +
    ggtitle("Fertility Rate by County for Asian or Pacific Islander Pop.") + 
    ylab("") + xlab("") 

MCH.CDC.Data_Race %>% filter(Mothers.Race == "Black or African American") %>%
ggplot(aes(x = Year, y = County,  fill = Fertility.Rate)) +
    geom_tile(color = "grey50") +
    scale_x_continuous(expand = c(0,0)) +
    scale_fill_gradientn("Fertility Rate", limits = c(0,115),
                         colors = brewer.pal(9, "Reds")) +
    theme_minimal() +  
    theme(panel.grid = element_blank()) +
    ggtitle("Fertility Rate by County for Black or African American Pop.") + 
    ylab("") + xlab("") 

MCH.CDC.Data_Race %>% filter(Mothers.Race == "White") %>%
ggplot(aes(x = Year, y = County,  fill = Fertility.Rate)) +
    geom_tile(color = "grey50") +
    scale_x_continuous(expand = c(0,0)) +
    scale_fill_gradientn("Fertility Rate", limits = c(0,115),
                         colors = brewer.pal(9, "Reds")) +
    theme_minimal() +  
    theme(panel.grid = element_blank()) +
    ggtitle("Fertility Rate by County for White Pop.") + 
    ylab("") + xlab("") 

#Fresno, Ranked #1 in Pesticide Use
MCH.CDC.Data_Race %>% filter(County == "Fresno") %>% ggplot(aes(Year, Fertility.Rate, color = Mothers.Race)) + geom_line() + labs( y="Fertility Rate", title = "Race and Fertility Data in Fresno County", subtitle = "2007-2019", color = "Mother's Race", caption = "Data Source: CDC WONDER Online Database") + ylim(0,115)

#Kern, Ranked #2 in Pesticide Use
MCH.CDC.Data_Race %>% filter(County == "Kern") %>% ggplot(aes(Year, Fertility.Rate, color = Mothers.Race)) + geom_line() + labs( y="Fertility Rate", title = "Race and Fertility Data in Kern County", subtitle = "2007-2019", color = "Mother's Race", caption = "Data Source: CDC WONDER Online Database") + ylim(0,115)

#Tulare, Ranked #3 in Pesticide Use
MCH.CDC.Data_Race %>% filter(County == "Tulare") %>% ggplot(aes(Year, Fertility.Rate, color = Mothers.Race)) + geom_line() + labs( y="Fertility Rate", title = "Race and Fertility Data in Tulare County", subtitle = "2007-2019", color = "Mother's Race", caption = "Data Source: CDC WONDER Online Database") + ylim(0,115)

#San Joaquin, Ranked #4 in Pesticide Use
MCH.CDC.Data_Race %>% filter(County == "San Joaquin") %>% ggplot(aes(Year, Fertility.Rate, color = Mothers.Race)) + geom_line() + labs( y="Fertility Rate", title = "Race and Fertility Data in San Joaquin County", subtitle = "2007-2019", color = "Mother's Race", caption = "Data Source: CDC WONDER Online Database") + ylim(0,115)

#Los Angeles, Comparison 
MCH.CDC.Data_Race %>% filter(County == "Los Angeles") %>% ggplot(aes(Year, Fertility.Rate, color = Mothers.Race)) + geom_line() + labs( y="Fertility Rate", title = "Race and Fertility Data in Los Angeles County", subtitle = "2007-2019", color = "Mother's Race", caption = "Data Source: CDC WONDER Online Database") + ylim(0,115)

Preterm Birth (Stratified By Race)

The next variable I examined was preterm birth measured as LMP (last menstrual period) gestation age in weeks, and I added stratification by race. I first visualized the preterm birth across all counties over the years and then stratified the tile plots by race. I then viewed the trends of gestational age by race in our counties of interest which include Fresno, Kern, Tulare, San Joaquin (since they are continuously ranked as the top 4 in highest use of pesticides), and Los Angeles (as a control/comparison county) since LA is one of the most populated and most diverse counties in California. Fresno, Kern, Tulare, and San Joaquin counties are also all a part of San Joaquin Valley which we mentioned in our background and related work section to be of special interest to us because it is California’s most productive agricultural region and has one of the highest amounts of pesticide use. I also included a horizontal line for the county level stratified data at 37 weeks which is the cutoff for defining preterm birth.

#Preterm Birth GGPLOTS (With Race)
MCH.CDC.Data %>% 
    ggplot(aes(x = Year, y = County,  fill = Average.LMP.Gestational.Age)) +
    geom_tile(color = "grey50") +
    scale_x_continuous(expand = c(0,0)) +
    scale_fill_gradientn("Average Gestational Age (LMP) in Weeks", limits = c(36,40),
                         colors = brewer.pal(9, "Reds")) +
    theme_minimal() +  
    theme(panel.grid = element_blank()) +
    ggtitle("Average Gestational Age by County") + 
    ylab("") + xlab("") 

MCH.CDC.Data_Race %>% filter(Mothers.Race == "American Indian or Alaska Native") %>%
ggplot(aes(x = Year, y = County,  fill = Average.LMP.Gestational.Age)) +
    geom_tile(color = "grey50") +
    scale_x_continuous(expand = c(0,0)) +
    scale_fill_gradientn("Average Gestational Age (LMP) in Weeks", limits = c(36,40),
                         colors = brewer.pal(9, "Reds")) +
    theme_minimal() +  
    theme(panel.grid = element_blank()) +
    ggtitle("Average Gestational Age by County for American Indian or Alaska Native Pop.") + 
    ylab("") + xlab("") 

MCH.CDC.Data_Race %>% filter(Mothers.Race == "Asian or Pacific Islander") %>%
ggplot(aes(x = Year, y = County,  fill = Average.LMP.Gestational.Age)) +
    geom_tile(color = "grey50") +
    scale_x_continuous(expand = c(0,0)) +
    scale_fill_gradientn("Average Gestational Age (LMP) in Weeks", limits = c(36,40),
                         colors = brewer.pal(9, "Reds")) +
    theme_minimal() +  
    theme(panel.grid = element_blank()) +
    ggtitle("Average Gestational Age by County for Asian or Pacific Islander Pop.") + 
    ylab("") + xlab("") 

MCH.CDC.Data_Race %>% filter(Mothers.Race == "Black or African American") %>%
ggplot(aes(x = Year, y = County,  fill = Average.LMP.Gestational.Age)) +
    geom_tile(color = "grey50") +
    scale_x_continuous(expand = c(0,0)) +
    scale_fill_gradientn("Average Gestational Age (LMP) in Weeks", limits = c(36,40),
                         colors = brewer.pal(9, "Reds")) +
    theme_minimal() +  
    theme(panel.grid = element_blank()) +
    ggtitle("Average Gestational Age by County for Black or African American Pop.") + 
    ylab("") + xlab("") 

MCH.CDC.Data_Race %>% filter(Mothers.Race == "White") %>%
ggplot(aes(x = Year, y = County,  fill = Average.LMP.Gestational.Age)) +
    geom_tile(color = "grey50") +
    scale_x_continuous(expand = c(0,0)) +
    scale_fill_gradientn("Average Gestational Age (LMP) in Weeks", limits = c(36,40),
                         colors = brewer.pal(9, "Reds")) +
    theme_minimal() +  
    theme(panel.grid = element_blank()) +
    ggtitle("Average Gestational Age by County for White Pop.") + 
    ylab("") + xlab("") 

#Fresno, Ranked #1 in Pesticide Use
MCH.CDC.Data_Race %>% filter(County == "Fresno") %>% ggplot(aes(Year, Average.LMP.Gestational.Age, color = Mothers.Race)) + geom_line() + labs( y="Average LMP Gestational Age (Weeks)", title = "Preterm Birth in Fresno County", subtitle = "2007-2019", color = "Mother's Race", caption = "Data Source: CDC WONDER Online Database") + geom_hline(yintercept = 37, size =2) + ylim(36,40)

#Kern, Ranked #2 in Pesticide Use
MCH.CDC.Data_Race %>% filter(County == "Kern") %>% ggplot(aes(Year, Average.LMP.Gestational.Age, color = Mothers.Race)) + geom_line() + labs( y="Average LMP Gestational Age (Weeks)", title = "Preterm Birth in Kern County", subtitle = "2007-2019", color = "Mother's Race", caption = "Data Source: CDC WONDER Online Database")  + geom_hline(yintercept = 37, size =2) + ylim(36,40)

#Tulare, Ranked #3 in Pesticide Use
MCH.CDC.Data_Race %>% filter(County == "Tulare") %>% ggplot(aes(Year, Average.LMP.Gestational.Age, color = Mothers.Race)) + geom_line() + labs( y="Average LMP Gestational Age (Weeks)", title = "Preterm Birth in Tulare", subtitle = "2007-2019", color = "Mother's Race", caption = "Data Source: CDC WONDER Online Database")  + geom_hline(yintercept = 37, size =2) + ylim(36,40)

#San Joaquin, Ranked #4 in Pesticide Use
MCH.CDC.Data_Race %>% filter(County == "San Joaquin") %>% ggplot(aes(Year, Average.LMP.Gestational.Age, color = Mothers.Race)) + geom_line() + labs( y="Average LMP Gestational Age (Weeks)", title = "Preterm Birth in San Joaquin County", subtitle = "2007-2019", color = "Mother's Race", caption = "Data Source: CDC WONDER Online Database") + geom_hline(yintercept = 37, size =2) + ylim(36,40)

#Los Angeles County, Comparison Group
MCH.CDC.Data_Race %>% filter(County == "Los Angeles") %>% ggplot(aes(Year, Average.LMP.Gestational.Age, color = Mothers.Race)) + geom_line() + labs( y="Average LMP Gestational Age (Weeks)", title = "Preterm Birth in Los Angeles County", subtitle = "2007-2019", color = "Mother's Race", caption = "Data Source: CDC WONDER Online Database")  + geom_hline(yintercept = 37, size =2) + ylim(36,40)

#Preterm Birth Cutoff is 37 Weeks (Horizontal Line)
Birth Weight (Stratified By Race)

The next variable I examined was birth weight measured in grams, and I added stratification by race. I first visualized the birth weight across all counties over the years and then stratified the tile plots by race. I then viewed the trends of birth weight by race in our counties of interest which include Fresno, Kern, Tulare, San Joaquin (since they are continuously ranked as the top 4 in highest use of pesticides), and Los Angeles (as a control/comparison county) since LA is one of the most populated and most diverse counties in California. Fresno, Kern, Tulare, and San Joaquin counties are also all a part of San Joaquin Valley which we mentioned in our background and related work section to be of special interest to us because it is California’s most productive agricultural region and has one of the highest amounts of pesticide use. I also included a horizontal line for the county level stratified data at 25000 grams which is the cutoff for defining low birth weight.

#Birth weight GGPLOTS (With Race)
MCH.CDC.Data %>% 
    ggplot(aes(x = Year, y = County,  fill = Average.Birth.Weight)) +
    geom_tile(color = "grey50") +
    scale_x_continuous(expand = c(0,0)) +
    scale_fill_gradientn("Average Birth Weight in Grams", limits = c(2400, 3600),
                         colors = brewer.pal(9, "Reds")) +
    theme_minimal() +  
    theme(panel.grid = element_blank()) +
    ggtitle("Average Birth Weight by County") + 
    ylab("") + xlab("") 

MCH.CDC.Data_Race %>% filter(Mothers.Race == "American Indian or Alaska Native") %>%
ggplot(aes(x = Year, y = County,  fill = Average.Birth.Weight)) +
    geom_tile(color = "grey50") +
    scale_x_continuous(expand = c(0,0)) +
    scale_fill_gradientn("Average Birth Weight in Grams", limits = c(2400, 3600),
                         colors = brewer.pal(9, "Reds")) +
    theme_minimal() +  
    theme(panel.grid = element_blank()) +
    ggtitle("Average Birth Weight by County for American Indian or Alaska Native Pop.") + 
    ylab("") + xlab("") 

MCH.CDC.Data_Race %>% filter(Mothers.Race == "Asian or Pacific Islander") %>%
ggplot(aes(x = Year, y = County,  fill = Average.Birth.Weight)) +
    geom_tile(color = "grey50") +
    scale_x_continuous(expand = c(0,0)) +
    scale_fill_gradientn("Average Birth Weight in Grams", limits = c(2400, 3600),
                         colors = brewer.pal(9, "Reds")) +
    theme_minimal() +  
    theme(panel.grid = element_blank()) +
    ggtitle("Average Birth Weight by County for Asian or Pacific Islander Pop.") + 
    ylab("") + xlab("") 

MCH.CDC.Data_Race %>% filter(Mothers.Race == "Black or African American") %>%
ggplot(aes(x = Year, y = County,  fill = Average.Birth.Weight)) +
    geom_tile(color = "grey50") +
    scale_x_continuous(expand = c(0,0)) +
    scale_fill_gradientn("Average Birth Weight Grams", limits = c(2400, 3600),
                         colors = brewer.pal(9, "Reds")) +
    theme_minimal() +  
    theme(panel.grid = element_blank()) +
    ggtitle("Average Birth Weight by County for Black or African American Pop.") + 
    ylab("") + xlab("") 

MCH.CDC.Data_Race %>% filter(Mothers.Race == "White") %>%
ggplot(aes(x = Year, y = County,  fill = Average.Birth.Weight)) +
    geom_tile(color = "grey50") +
    scale_x_continuous(expand = c(0,0)) +
    scale_fill_gradientn("Average Birth Weight in Grams", limits = c(2400, 3600),
                         colors = brewer.pal(9, "Reds")) +
    theme_minimal() +  
    theme(panel.grid = element_blank()) +
    ggtitle("Average Birth Weight by County for White Pop.") + 
    ylab("") + xlab("") 

#Fresno, Ranked #1 in Pesticide Use
MCH.CDC.Data_Race %>% filter(County == "Fresno") %>% ggplot(aes(Year, Average.Birth.Weight, color = Mothers.Race)) + geom_line() + labs( y="Average Birth Weight (Grams)", title = "Birth Weight Data in Fresno County", subtitle = "2007-2019", color = "Mother's Race", caption = "Data Source: CDC WONDER Online Database") + geom_hline(yintercept = 2500, size =2)+ ylim(2400, 3600)

#Kern, Ranked #2 in Pesticide Use
MCH.CDC.Data_Race %>% filter(County == "Kern") %>% ggplot(aes(Year, Average.Birth.Weight, color = Mothers.Race)) + geom_line() + labs( y="Average Birth Weight (Grams)", title = "Birth Weight Data in Kern County", subtitle = "2007-2019", color = "Mother's Race", caption = "Data Source: CDC WONDER Online Database")  + geom_hline(yintercept = 2500, size =2)+ ylim(2400, 3600)

#Tulare, Ranked #3 in Pesticide Use
MCH.CDC.Data_Race %>% filter(County == "Tulare") %>% ggplot(aes(Year, Average.Birth.Weight, color = Mothers.Race)) + geom_line() + labs( y="Average Birth Weight (Grams)", title = "Birth Weight Data in Tulare", subtitle = "2007-2019", color = "Mother's Race", caption = "Data Source: CDC WONDER Online Database")  + geom_hline(yintercept = 2500, size =2)+ ylim(2400, 3600)

#San Joaquin, Ranked #4 in Pesticide Use
MCH.CDC.Data_Race %>% filter(County == "San Joaquin") %>% ggplot(aes(Year, Average.Birth.Weight, color = Mothers.Race)) + geom_line() + labs( y="Average Birth Weight (Grams)", title = "Birth Weight Data in San Joaquin County", subtitle = "2007-2019", color = "Mother's Race", caption = "Data Source: CDC WONDER Online Database") + geom_hline(yintercept = 2500, size =2)+ ylim(2400, 3600)

#Los Angeles, Comparison Group
MCH.CDC.Data_Race %>% filter(County == "Los Angeles") %>% ggplot(aes(Year, Average.Birth.Weight, color = Mothers.Race)) + geom_line() + labs( y="Average Birth Weight (Grams)", title = "Birth Weight Data in Los Angeles County", subtitle = "2007-2019", color = "Mother's Race", caption = "Data Source: CDC WONDER Online Database")  + geom_hline(yintercept = 2500, size =2) + ylim(2400, 3600)

#LBW Cutoff is 2500 Grams (Horizontal Line)

Pesticide Data Wrangling

The CDPR creates a Summary of Pesticide Use Data every year that details pesticide usage throughout the state. Data from the 2016 report onward were available directly from a database linked in the report. For data prior to 2016, the one would need to navigate multiple large zip files before being able to find what they need. After many fruitless attempts to navigate the file system linked in the documents, I decided to get the desired table (Table 1, which shows the total pounds of pesticide active ingredients reported in each county and that county’s rank in the current year and the previous year) directly from the documents using the pdftools package and string processing. I read in the pages from each pdf and wrote the tables to csv file (“table1_YEAR.csv”) using the code in import_data2.R. These csv files were then combined to create a single table for pesticide usage from 2007 to 2016. I also imported a table of all the pesticides determined to cause reproductive harm according to Proposition 65. We were not able to find a use for that information.

county_ranks16 <- read_delim("table1_county_rank_2016.txt", "\t", escape_double = FALSE, trim_ws = TRUE)
repro_lbs16 <- read_delim("table3_reproductive_lbs_2016.txt", "\t", escape_double = FALSE, trim_ws = TRUE)
repro_acre16 <- read_delim("table4_reproductive_acres_2016.txt", "\t", escape_double = FALSE, trim_ws = TRUE)


table1_2016 <- county_ranks16 %>% transmute(county = COUNTY, 
                                     lbs_2015 = LBS_2015, rank_2015 = RANK_2015, 
                                     lbs_2016 = LBS_2016, rank_2016 = RANK_2016)
# column 1 is the county
# columns 2-3 have the previous year data
# columns 4-5 have the current year data

# we only want columns 1-3 for the most up-to-date data for all years before 2016
all_dat <- list(read_csv("table1_2007.csv")[1:3],
                read_csv("table1_2008.csv")[1:3],
                read_csv("table1_2009.csv")[1:3],
                read_csv("table1_2010.csv")[1:3],
                read_csv("table1_2011.csv")[1:3],
                read_csv("table1_2012.csv")[1:3],
                read_csv("table1_2013.csv")[1:3],
                read_csv("table1_2014.csv")[1:3],
                read_csv("table1_2015.csv")[1:3])


table1 <- Reduce(function(x, y) left_join(x, y, by = "county"), all_dat)


long_table1 <- table1 %>% pivot_longer(!county, names_to = 'usage', values_to = "value")
#table1_ranks <- long_table1 %>% filter(str_starts(usage, "rank"))
table1_lbs <- long_table1 %>% filter(str_starts(usage, "lbs"))

long_table2 <- table1_2016 %>% pivot_longer(!county, names_to = 'usage', values_to = "value")
#table1_ranks_1516 <- long_table2 %>% filter(str_starts(usage, "rank"))
table1_lbs_1516<- long_table2 %>% filter(str_starts(usage, "lbs"))

table1_lbs$usage <- as.numeric(gsub("[^[:digit:]]+", "", table1_lbs$usage))
table1_lbs_1516$usage <- as.numeric(gsub("[^[:digit:]]+", "", table1_lbs_1516$usage))
combined_pesticide_use <- table1_lbs %>% full_join(table1_lbs_1516) 
class(combined_pesticide_use$usage) 
## [1] "numeric"
combined_pesticide_use <- combined_pesticide_use %>% group_by(usage) 
combined_pesticide_use <- combined_pesticide_use %>% arrange(usage)

LBW Data from CDC (Wrangling for use in ShinyApp):

This is my data wrangling process for low birth weight for the CDC WONDER database. By default, CDC WONDER live birth database only displayed counties that had a county population >100,000. I only looked at low birth rate here and this is for my shiny app bar graph.

#MCH_CDC Data for low birth weight 
#data wrangling mch cdc data
cdc_lowbirthweight <- read.delim("MCH CDC Data.txt",  sep ="\t", dec=".", header = TRUE, stringsAsFactors = FALSE)
cdc_lowbirthweight  <- cdc_lowbirthweight [-c(482:538), ]
cdc_lowbirthweight  <- cdc_lowbirthweight [ ,-c(1, 3, 5, 7)]
MCH.CDC.Data.Total <- read.delim("MCH CDC Data Total.txt",  sep ="\t", dec=".", header = TRUE, stringsAsFactors = FALSE)
MCH.CDC.Data.Total <- MCH.CDC.Data.Total[,-c(1, 3, 5)]

#I noticed that the data I downloaded did not include total # of births so merging two datasets (one that has total # of birth counts and the other with low birth weight +very low birth weight counts)
df1 <- full_join(cdc_lowbirthweight , MCH.CDC.Data.Total, by=c("Year", "County"))
df1<- df1 %>% rename("cases" = "Births.x", "total_births" = "Births.y")

#Note: LBW = Low birth weight + Very low birth weight counts; Total Births = Total # of Birth
col_order <- c("Year", "County", "total_births",
               "cases", "Average.Birth.Weight", "Standard.Deviation.for.Average.Birth.Weight",
               "Average.Age.of.Mother", "Standard.Deviation.for.Average.Age.of.Mother","Average.LMP.Gestational.Age",
               "Standard.Deviation.for.Average.LMP.Gestational.Age")
df2 <- df1[,col_order]
df2[df2$County == "Alameda County, CA", "County"] <-"alameda"
df2[df2$County == "Butte County, CA", "County"] <-"butte"
df2[df2$County == "Contra Costa County, CA", "County"] <-"contra costa"
df2[df2$County == "El Dorado County, CA", "County"] <-"el dorado"
df2[df2$County == "Fresno County, CA", "County"] <-"fresno"
df2[df2$County == "Humboldt County, CA", "County"] <-"humboldt"
df2[df2$County == "Imperial County, CA", "County"] <-"imperial"
df2[df2$County == "Kern County, CA", "County"] <-"kern"
df2[df2$County == "Kings County, CA", "County"] <-"kings"
df2[df2$County == "Los Angeles County, CA", "County"] <-"los angeles"
df2[df2$County == "Madera County, CA", "County"] <-"madera"
df2[df2$County == "Marin County, CA", "County"] <-"marin"
df2[df2$County == "Contra Costa County, CA", "County"] <-"mariposa"
df2[df2$County == "Merced County, CA", "County"] <-"merced"
df2[df2$County == "Monterey County, CA", "County"] <-"monterey"
df2[df2$County == "Napa County, CA", "County"] <-"napa"
df2[df2$County == "Orange County, CA", "County"] <-"orange"
df2[df2$County == "Placer County, CA", "County"] <-"placer"
df2[df2$County == "Riverside County, CA", "County"] <-"riverside"
df2[df2$County == "Sacramento County, CA", "County"] <-"sacramento"
df2[df2$County == "San Bernardino County, CA", "County"] <-"san bernardino"
df2[df2$County == "San Diego County, CA", "County"] <-"san diego"
df2[df2$County == "San Francisco County, CA", "County"] <-"san francisco"
df2[df2$County == "San Joaquin County, CA", "County"] <-"san joaquin"
df2[df2$County == "San Luis Obispo County, CA", "County"] <-"san luis obispo"
df2[df2$County == "San Mateo County, CA", "County"] <-"san mateo"
df2[df2$County == "Santa Barbara County, CA", "County"] <-"santa barbara"
df2[df2$County == "Santa Clara County, CA", "County"] <-"santa clara"
df2[df2$County == "Santa Cruz County, CA", "County"] <-"santa cruz"
df2[df2$County == "Shasta County, CA", "County"] <-"shasta"
df2[df2$County == "Solano County, CA", "County"] <-"solano"
df2[df2$County == "Sonoma County, CA", "County"] <-"sonoma"
df2[df2$County == "Stanislaus County, CA", "County"] <-"stanislaus"
df2[df2$County == "Tulare County, CA", "County"] <-"tulare"
df2[df2$County == "Ventura County, CA", "County"] <-"ventura"
df2[df2$County == "Yolo County, CA", "County"] <-"yolo"
df2 <- df2 %>% filter(!is.na(total_births)) %>% filter(!is.na(cases)) %>% mutate(rate = cases/total_births * 10^2)
df2$County <- df2$County %>% str_to_title()
Preterm Birth Data from CDC Wonder Database (Wrangling for use in ShinyApp)::

This is my data wrangling process for preterm birth for the CDC WONDER database. By default, CDC WONDER live birth database only displayed counties that had a county population >100,000. I only looked at preterm birth here and this is for my shiny app bar graph.

cdc_pretermbirth <- read.delim("Preterm birth.txt",  sep ="\t", dec=".", header = TRUE, stringsAsFactors = FALSE)
cdc_pretermbirth  <- cdc_pretermbirth [-c(422:472), ]
cdc_pretermbirth  <- cdc_pretermbirth [ ,-c(1, 3, 5)]
cdc_pretermbirth <- cdc_pretermbirth %>% rename("Events" = "Births")
MCH.CDC.Data.Total <- read.delim("MCH CDC Data Total.txt",  sep ="\t", dec=".", header = TRUE, stringsAsFactors = FALSE)
MCH.CDC.Data.Total <- MCH.CDC.Data.Total[,-c(1, 3, 5)]
MCH.CDC.Data.Total <- MCH.CDC.Data.Total %>% rename("total_birth" = "Births")

df1_pt <- full_join(cdc_pretermbirth , MCH.CDC.Data.Total, by=c("Year", "County"))

df1_pt[df1_pt$County == "Alameda County, CA", "County"] <-"alameda"
df1_pt[df1_pt$County == "Butte County, CA", "County"] <-"butte"
df1_pt[df1_pt$County == "Contra Costa County, CA", "County"] <-"contra costa"
df1_pt[df1_pt$County == "El Dorado County, CA", "County"] <-"el dorado"
df1_pt[df1_pt$County == "Fresno County, CA", "County"] <-"fresno"
df1_pt[df1_pt$County == "Humboldt County, CA", "County"] <-"humboldt"
df1_pt[df1_pt$County == "Imperial County, CA", "County"] <-"imperial"
df1_pt[df1_pt$County == "Kern County, CA", "County"] <-"kern"
df1_pt[df1_pt$County == "Kings County, CA", "County"] <-"kings"
df1_pt[df1_pt$County == "Los Angeles County, CA", "County"] <-"los angeles"
df1_pt[df1_pt$County == "Madera County, CA", "County"] <-"madera"
df1_pt[df1_pt$County == "Marin County, CA", "County"] <-"marin"
df1_pt[df1_pt$County == "Contra Costa County, CA", "County"] <-"mariposa"
df1_pt[df1_pt$County == "Merced County, CA", "County"] <-"merced"
df1_pt[df1_pt$County == "Monterey County, CA", "County"] <-"monterey"
df1_pt[df1_pt$County == "Napa County, CA", "County"] <-"napa"
df1_pt[df1_pt$County == "Orange County, CA", "County"] <-"orange"
df1_pt[df1_pt$County == "Placer County, CA", "County"] <-"placer"
df1_pt[df1_pt$County == "Riverside County, CA", "County"] <-"riverside"
df1_pt[df1_pt$County == "Sacramento County, CA", "County"] <-"sacramento"
df1_pt[df1_pt$County == "San Bernardino County, CA", "County"] <-"san bernardino"
df1_pt[df1_pt$County == "San Diego County, CA", "County"] <-"san diego"
df1_pt[df1_pt$County == "San Francisco County, CA", "County"] <-"san francisco"
df1_pt[df1_pt$County == "San Joaquin County, CA", "County"] <-"san joaquin"
df1_pt[df1_pt$County == "San Luis Obispo County, CA", "County"] <-"san luis obispo"
df1_pt[df1_pt$County == "San Mateo County, CA", "County"] <-"san mateo"
df1_pt[df1_pt$County == "Santa Barbara County, CA", "County"] <-"santa barbara"
df1_pt[df1_pt$County == "Santa Clara County, CA", "County"] <-"santa clara"
df1_pt[df1_pt$County == "Santa Cruz County, CA", "County"] <-"santa cruz"
df1_pt[df1_pt$County == "Shasta County, CA", "County"] <-"shasta"
df1_pt[df1_pt$County == "Solano County, CA", "County"] <-"solano"
df1_pt[df1_pt$County == "Sonoma County, CA", "County"] <-"sonoma"
df1_pt[df1_pt$County == "Stanislaus County, CA", "County"] <-"stanislaus"
df1_pt[df1_pt$County == "Tulare County, CA", "County"] <-"tulare"
df1_pt[df1_pt$County == "Ventura County, CA", "County"] <-"ventura"
df1_pt[df1_pt$County == "Yolo County, CA", "County"] <-"yolo"
df1_pt <- df1_pt %>% mutate(County = str_to_title(County))
df1_pt <- df1_pt %>% filter(!is.na("total_birth")) %>% filter(!is.na(Events)) %>% mutate(rate = Events/total_birth * 10^2)

Data Joining of CDC WONDER Live Birth Data (Low Birth Weight and Pre-Term Birth) and Pesticide Data

I then joined the CDC WONDER data (low birth weight and preterm birth) and Zainab’s wrangled pesticide data to come up with a joint data. I then generated bar graphs to visualize the trend across a span of 2007-2016 (Please see shiny app). We noticed that Fresno and Kern county were the two top counties that used the highest amounts of pesticide and found out that San Joaquin Valley is a region that’s agriculturally productive.

county_ranks16 <- read_delim("table1_county_rank_2016.txt", "\t", escape_double = FALSE, trim_ws = TRUE)
repro_lbs16 <- read_delim("table3_reproductive_lbs_2016.txt", "\t", escape_double = FALSE, trim_ws = TRUE)
repro_acre16 <- read_delim("table4_reproductive_acres_2016.txt", "\t", escape_double = FALSE, trim_ws = TRUE)

table1_2016 <- county_ranks16 %>% transmute(county = COUNTY, 
                                            lbs_2015 = LBS_2015, rank_2015 = RANK_2015, 
                                            lbs_2016 = LBS_2016, rank_2016 = RANK_2016)

all_dat <- list(read_csv("table1_2007.csv")[1:3],
                read_csv("table1_2008.csv")[1:3],
                read_csv("table1_2009.csv")[1:3],
                read_csv("table1_2010.csv")[1:3],
                read_csv("table1_2011.csv")[1:3],
                read_csv("table1_2012.csv")[1:3],
                read_csv("table1_2013.csv")[1:3],
                read_csv("table1_2014.csv")[1:3],
                read_csv("table1_2015.csv")[1:3])


table1 <- Reduce(function(x, y) left_join(x, y, by = "county"), all_dat)

long_table1 <- table1 %>% pivot_longer(!county, names_to = 'usage', values_to = "value")
table1_ranks <- long_table1 %>% filter(str_starts(usage, "rank"))
table1_lbs <- long_table1 %>% filter(str_starts(usage, "lbs"))
table1_lbs$usage <- as.numeric(gsub("[^[:digit:]]+", "", table1_lbs$usage))

long_table2 <- table1_2016 %>% pivot_longer(!county, names_to = 'usage', values_to = "value")
table1_ranks_1516 <- long_table2 %>% filter(str_starts(usage, "rank"))
table1_lbs_1516<- long_table2 %>% filter(str_starts(usage, "lbs"))


table1_lbs_1516$usage <- as.numeric(gsub("[^[:digit:]]+", "", table1_lbs_1516$usage))
combined_pesticide_use <- table1_lbs %>% full_join(table1_lbs_1516) 
combined_pesticide_use <- combined_pesticide_use %>% group_by(usage) 
combined_pesticide_use <- combined_pesticide_use %>% arrange(usage)

averagebw <-df2 %>% select("County", "Year", "rate")
pesticide_averagebw_join <- averagebw %>% inner_join(combined_pesticide_use, by = c("County" = "county", "Year" = "usage")) 

averagept <-df1_pt %>% select("County", "Year", "rate")
pesticide_averagept_join <- averagept %>% inner_join(combined_pesticide_use, by = c("County" = "county", "Year" = "usage")) 

#bar graph of low birth weight
pesticide_averagebw_join %>% ggplot(aes(County, rate)) + geom_col() + ylab("Low Birth Weight Rate (%)") +xlab("") +
                theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 1/2)) 

#bar graph of pesticide 
pesticide_averagept_join %>% ggplot(aes(County, value)) + geom_col() + ylab("Pesticide Use (Pounds)") +xlab("") + 
            theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 1/2))

Code for Creating the Map - Leaflet Map Using Preterm Birth Data I Wrangled Earlier (See Shiny App for the Final Result)

This is a similar spatial map but for pesticide use. Following is the wrangling process of generating a map that shapes the map of California, and merging that spatial object with pesticide use that I wrangled earlier to generate a leaflet map.

averagept_df <- combined_pesticide_use %>% filter(usage == "2016")

map <- readOGR(path.expand("cb_2018_us_county_20m.shp"),
               layer = "cb_2018_us_county_20m", stringsAsFactors = FALSE)
## OGR data source with driver: ESRI Shapefile 
## Source: "/Users/lararostomian/Desktop/Harvard/Classes/BST 260/datascience-project/Data Prep (& Final RMD)/cb_2018_us_county_20m.shp", layer: "cb_2018_us_county_20m"
## with 3220 features
## It has 9 fields
## Integer64 fields read as strings:  ALAND AWATER
Statekey<-read.csv('./STATEFPtoSTATENAME_Key.csv', colClasses=c('character'))
map<-merge(x=map, y=Statekey, by="STATEFP", all=TRUE)
SingleState <- subset(map, map$STATENAME %in% c(
    "California"
))

spatial_pesticide <-sp::merge(x=SingleState, y=averagept_df, by.x="NAME", by.y="county", by=x)

binpes <- c(200, 100145, 1131454, 3345277, Inf)
pal3 <- colorBin(
    palette = "magma",
    domain = spatial_pesticide$value, n=7, bins=binpes)

leaflet(spatial_pesticide, options = leafletOptions(zoomControl = TRUE, zoomLevelFixed = FALSE, dragging=TRUE, minZoom = 5.3, maxZoom = 9)) %>% 
                setView(lat = 36.778259, lng = -119.417931, zoom = 6) %>%
                addPolygons(color = "Black", weight = 1, smoothFactor = 0.5, 
                            opacity = 1.0, fillOpacity = 0.5, layerId = ~NAME,
                            fillColor = ~pal3(value), 
                            popup = ~as.factor(paste0("<b><font size=\"4\"><center>County: </b>",spatial_pesticide$NAME,"</font></center>","Amounts of Pesticides used </b>", sprintf("%1.2f", spatial_pesticide$value),"<br/>"))) %>%
                addLegend(pal = pal3, values = spatial_pesticide$value, opacity = 1, title="Amounts of Pesticide Used (Pounds)")

Regression Analysis

#Top 10 Counties in term of pesticide usage, according to CFDA
agro <- c("Kern", "Tulare", "Fresno", "Monterey", "Merced", "Stanislaus", 
          "San Joaquin", "Ventura", "Imperial", "Kings")

mch_regression <- MCH.CDC.Data_Race %>% 
  filter(Year == 2016) %>%
  mutate(agricultural = ifelse(County %in% agro, 1, 0))

First we fit a model, stratified by agricultural ranking.

#best model in terms of p-values and adjusted R-squared, contains and interaction term between gestational age and top ten status
linmod <- lm(Average.Birth.Weight ~ Average.LMP.Gestational.Age + factor(agricultural) + Average.LMP.Gestational.Age * factor(agricultural), mch_regression)
summary(linmod)[4]
## $coefficients
##                                                     Estimate Std. Error
## (Intercept)                                       -6880.0613  995.26975
## Average.LMP.Gestational.Age                         262.4304   25.70411
## factor(agricultural)1                              5256.6615 1888.23917
## Average.LMP.Gestational.Age:factor(agricultural)1  -136.4639   48.80666
##                                                     t value     Pr(>|t|)
## (Intercept)                                       -6.912760 1.795659e-10
## Average.LMP.Gestational.Age                       10.209666 1.996331e-18
## factor(agricultural)1                              2.783896 6.154339e-03
## Average.LMP.Gestational.Age:factor(agricultural)1 -2.796011 5.940801e-03
summary(linmod)[9]
## $adj.r.squared
## [1] 0.4615328
#On average, babies born to mothers living in a top ten agricultural county have an average birth weight less than those born to mothers that do not live in a top ten agricultural county, for a given LMP gestational age. As average LMP gestational age increases, average birth weight also increases.
mch_regression %>%
  ggplot(aes(Average.LMP.Gestational.Age, Average.Birth.Weight, color = factor(agricultural))) + 
  geom_point() + 
  geom_line(aes(y = predict(linmod)))  + 
  xlab("Average LMP Gestational Age (weeks)") + 
  ylab("Average Birth Weight (grams)") + 
  scale_color_discrete(name = "Top Ten\nAgricultural\nCounty", labels = c('No', "Yes")) +
  ggtitle("Birth Weight Outcomes by Agriculture Category") + 
  ylim(2980, 3520) +
  xlim(37.6, 39.6)

Next we fit a models stratified by mother’s race.

#Black and Asian/Pacific Island populations tend to have lower average birth weights
mch_regression %>%
  ggplot(aes(Average.LMP.Gestational.Age, Average.Birth.Weight, color = Mothers.Race)) + 
  geom_point() +  
  xlab("Average LMP Gestational Age (weeks)") + 
  ylab("Average Birth Weight (grams)") + 
  scale_color_discrete(name = "Mother's Race") +
  ggtitle("Birth Weight Outcomes by Race") + 
  ylim(2980, 3520) +
  xlim(37.6, 39.6)

#simple linear model more parsimonious than the one that has the interaction term for American Indian/Alaska Native Mothers
#tho there are low populations for this group, so the numbers in the data may have a lot of variability among different years
amerindian_mod <- lm(Average.Birth.Weight ~ Average.LMP.Gestational.Age, filter(mch_regression, Mothers.Race == "American Indian or Alaska Native"))
summary(amerindian_mod)[4] #coefficients
## $coefficients
##                               Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)                 -3829.7135 1246.73649 -3.071791 4.495686e-03
## Average.LMP.Gestational.Age   184.9774   32.20391  5.743941 2.857815e-06
summary(amerindian_mod)[9] #adjusted r-squared
## $adj.r.squared
## [1] 0.5078807
q1 <- mch_regression %>%
  filter(Mothers.Race == "American Indian or Alaska Native") %>%
  ggplot(aes(Average.LMP.Gestational.Age, Average.Birth.Weight, color = factor(agricultural))) + 
  geom_point() + geom_line(aes(y = predict(amerindian_mod)), size = 1) + 
  xlab("Average LMP Gestational Age (weeks)") + 
  ylab("Average Birth Weight (grams)") + 
  scale_color_discrete(name = "Top Ten\nAgricultural\nCounty", labels = c('No', "Yes")) +
  ggtitle("Birth Weight Outcome for American Indian and Alaska Native Mothers") + 
  ylim(2980, 3520) +
  xlim(37.6, 39.6)
#even simple linear regression doesn't explain a lot of the errors for Asian mothers
asian_mod <- lm(Average.Birth.Weight ~ Average.LMP.Gestational.Age, filter(mch_regression, Mothers.Race == "Asian or Pacific Islander"))
summary(asian_mod)[4] #coefficients
## $coefficients
##                               Estimate Std. Error    t value   Pr(>|t|)
## (Intercept)                 -429.60128 1658.17961 -0.2590801 0.79718280
## Average.LMP.Gestational.Age   94.23017   42.87789  2.1976402 0.03509986
summary(asian_mod)[9] #adjusted r-squared
## $adj.r.squared
## [1] 0.1012334
q2 <- mch_regression %>%
  filter(Mothers.Race == "Asian or Pacific Islander") %>%
  ggplot(aes(Average.LMP.Gestational.Age, Average.Birth.Weight, color = factor(agricultural))) + 
  geom_point() + 
  geom_line(aes(y = predict(asian_mod)), size = 1) + 
  xlab("Average LMP Gestational Age (weeks)") + 
  ylab("Average Birth Weight (grams)") + 
  scale_color_discrete(name = "Top Ten\nAgricultural\nCounty", labels = c('No', "Yes")) +
  ggtitle("Birth Weight Outcome for Asian and Pacific Islander Mothers") + 
  ylim(2980, 3520) +
  xlim(37.6, 39.6)
black_mod <- lm(Average.Birth.Weight ~ Average.LMP.Gestational.Age + factor(agricultural) + Average.LMP.Gestational.Age*factor(agricultural), filter(mch_regression, Mothers.Race == "Black or African American"))
summary(black_mod)[4] #coefficients
## $coefficients
##                                                     Estimate Std. Error
## (Intercept)                                       -5672.4322 1361.94810
## Average.LMP.Gestational.Age                         230.1197   35.26328
## factor(agricultural)1                              5451.8961 2461.08502
## Average.LMP.Gestational.Age:factor(agricultural)1  -142.4039   63.79834
##                                                     t value     Pr(>|t|)
## (Intercept)                                       -4.164940 2.304953e-04
## Average.LMP.Gestational.Age                        6.525760 2.774693e-07
## factor(agricultural)1                              2.215241 3.422210e-02
## Average.LMP.Gestational.Age:factor(agricultural)1 -2.232094 3.297249e-02
summary(black_mod)[9] #adjusted r-squared
## $adj.r.squared
## [1] 0.585938
q3 <- mch_regression %>%
  filter(Mothers.Race == "Black or African American") %>%
  ggplot(aes(Average.LMP.Gestational.Age, Average.Birth.Weight, color = factor(agricultural))) + 
  geom_point() + geom_line(aes(y = predict(black_mod)), size = 1) + 
  xlab("Average LMP Gestational Age (weeks)") + 
  ylab("Average Birth Weight (grams)") + 
scale_color_discrete(name = "Top Ten\nAgricultural\nCounty", labels = c('No', "Yes")) +
  ggtitle("Birth Weight Outcome for Black Mothers") +
  ylim(2980, 3520) +
  xlim(37.6, 39.6)
white_mod <- lm(Average.Birth.Weight ~ Average.LMP.Gestational.Age + factor(agricultural) + Average.LMP.Gestational.Age*factor(agricultural), filter(mch_regression, Mothers.Race == "White"))
summary(white_mod)[4] #coefficients
## $coefficients
##                                                     Estimate Std. Error
## (Intercept)                                       -5250.8344 1098.58067
## Average.LMP.Gestational.Age                         221.3943   28.26190
## factor(agricultural)1                              6590.0755 2589.80533
## Average.LMP.Gestational.Age:factor(agricultural)1  -170.1409   66.77334
##                                                     t value     Pr(>|t|)
## (Intercept)                                       -4.779653 4.035348e-05
## Average.LMP.Gestational.Age                        7.833668 7.687165e-09
## factor(agricultural)1                              2.544622 1.613772e-02
## Average.LMP.Gestational.Age:factor(agricultural)1 -2.548036 1.600829e-02
summary(white_mod)[9] #adjusted r-squared
## $adj.r.squared
## [1] 0.6877013
q4 <- mch_regression %>%
  filter(Mothers.Race == "White" ) %>%
  ggplot(aes(Average.LMP.Gestational.Age, Average.Birth.Weight, color = factor(agricultural))) + 
  geom_point() + geom_line(aes(y = predict(white_mod)), size = 1) + 
  xlab("Average LMP Gestational Age (weeks)") + 
  ylab("Average Birth Weight (grams)") + 
scale_color_discrete(name = "Top Ten\nAgricultural\nCounty", labels = c('No', "Yes")) +
  ggtitle("Birth Weight Outcome for White Mothers") +
  ylim(2980, 3520) +
  xlim(37.6, 39.6)
#Imperial appears to be an influential point in the plot for Asian, Black, and White mothers
q1 #very small population, but the line fits well

q2 #simplest linear model still not good :()

q3 #appears to be a disparity, but Imperial my also be affecting the model

q4 #least variation in birth weights and gestational ages

Checking Assumptions for Linear Models by Checking Residuals

Our data set is small, about 50 counties total, so it was not feasible to divide the data into training and testing sets. I am checking the residuals to make sure the linear models are appropriate.

#LINE assumptions met, Black mothers in Imperial (26) appear to be influential
plot(linmod)

# residuals definitely skewed, Imperial (7) and Humboldt (6) cause problems
#linear model not appropriate at all at all :(
plot(asian_mod)

#LINE assumptions satisfied
plot(amerindian_mod)

# Imperial (7) strikes again! besides that it's good
plot(black_mod)

# Imperial (7), other than that it's good
plot(white_mod)